The 2019 Masters Tournament
Every year, professional golfers from the Men’s PGA Tour are invited to compete in a major championship called The Masters Tournament at Augusta National Golf Club in Augusta, Georgia. This tournament is hosted at the same location every year and is one of four major championships in the PGA Tour.
In this report, I will be exploring different variables from the 2019 Masters Tournament in order to determine if certain variables have relationships. I plan to address this question of interest by gathering data from the PGA Tour website, where I will then explore the data using descriptive statistics and graphs in RStudio. Throughout this exploratory data analysis, I will look at different variables using techniques like univariate and bivariate analyses in order to determine the which variables are related in the PGA Tour Masters Tournament.
The following packages are being used in this report:
PGA Tour Logo
The data for this project was collected from the PGA Tour website, in the statistics section. Wikipedia and Google were also used to help calculate a variable in the dataset known as Age. On the PGA Tour website, I looked at multiple statistics and manually transferred these statistics to an Excel spreadsheet. Once all of the data was entered into Excel, I saved the file as a .csv file, allowing me to transfer the data set into R Studio.
The main link I used was https://www.pgatour.com/stats.html which led me to the following links which is where I transferred the data from:
Stats for Placement, Rounds, Total Strokes:
Stats for Avg. Driving Distance, Total Distance, Total Drives:
Stats for Fairways hits, Possible Fairways, Driving accuracy:
Stats for Greens Hit, Number of Holes, and Greens in Regulation Percentage:
Stats for Number of 1 putts, Number of Holes, and 1 putt making percentage:
Stats for Total 3 putts, Total Rounds, and Average 3 putt per round:
Stats for Total 2 putts, Total Rounds, and Average 2 putt per round:
Stats for Total Putts, Total Rounds, Low total Putts, Average Putts per Round:
Stats for Birdie to Bogey Ratio, Total Birdies and Better, Total Bogeys and Worse:
Stats for Scoring Average Before the Cut (Total Strokes, Total Rounds, and Average Strokes per Round before the cut):
Stats for Par 3 Scoring Average (Total Strokes, Total Holes, Average Strokes per Par 3):
Stats for Par 4 Scoring Average (Total Strokes, Total Holes, Average Strokes per Par 4):
Stats for Par 5 Scoring Average (Total Strokes, Total Holes, Average Strokes per Par 5):
Stats for Rounds, Driving Distance, and Total Drives for all tournaments before the Masters for the year:
Stats for Rounds, Driving Accuracy Percentage, Fairways Hit, and Possible Fairways to hit for all tournaments before the Masters for the year:
Stats for Greens in Regulation Percentage, Greens Hit, and Number of Holes for all tournaments before the Masters for the year:
The data that was collected for the dataset used in this project was originally collected for the purpose of analyzing what, if any, variables have relationships for the PGA Tour 2019 Masters Tournament. All of this data was collected by the PGA Tour ShotLink System. This technology uses real-time information which is received by the system, and then transferred to the PGA Tour website. More on this system in the following link http://www.shotlink.com/about/history. Since the system transfers the information in real-time, I believe the data was collected on April 11th through April 14th in the year of 2019. However, it wasn’t until the months of September-October of 2020 that the data transferred to an accessible excel sheet used for this report. The original data set contains 51 total variables which each contain 65 observations. These observations are recorded from the 65 PGA players that made it through to the final cut of the tournament. Thus, each observation represents a statistic from different PGA Tour players that made it through to the final 4 rounds of the 2019 Masters Tournament. Since the data was transferred and plugged in into Excel by myself, there was not much cleaning that had to be done. However, there were a few missing values in the dataset. The reason for these missing values was because for certain variables, a few players had no reported observations for the variable. For example, in the variable Total.3.Putts, a few players did not have to putt the ball 3 times in the entirety of the tournament. In order to handle these missing values, I used common functions in R to change the missing value to the value 0. This was the case for a few variables in the dataset. However, in some other variables, if an observation was missing for a variable, a different procedure had to be executed. For example, another variable, Rounds.Y.D, gives the statistic of how many professional rounds of golf a player has played for the year before the tournament. Thus, if a player had not yet played a professional round of golf before the tournament, there was no recorded observation for the player thus being a missing value. Because of this, the missing values could be changed to 0, however, after looking into why the values were missing, the values were instead changed to the mean of the variable (when the missing values are removed). There is one more way that some missing values were handled that will be described in the data cleaning section to allow a better understanding of why the missing value was modified.
In the original dataset, certain variables were not needed due to these variables portraying the same information that another variable was providing. These variables also did not provide necessary information to the overall data set.
The following steps and code will show the cleaning that was done, while also giving reasoning behind the cleaning of the data:
The following code imports the initial data set:
#Read in the .csv file dataset
Masters2019 <- read.csv("~/Documents/Data Science/DSA8030/PGA Masters Test.csv")
After the dataset is imported, the following code will take a quick glimpse of the dataset that we are working with:
Note: The results are hidden for the purposes of saving space.
#Quick look at the data's variables and a quick look at descriptive statistics for the observations
names(Masters2019)
summary(Masters2019)
The following code will take a closer look at a few variables in the initial dataset before cleaning:
Note: These results are hidden and were used in the initial analysis to get a glimpse of their observations to determine if they should be kept in the final dataset.
# Look at these variables to see if we may need them/what variables have observations that are NA
Masters2019$Name
Masters2019$Rounds
Masters2019$Total.Drives
Masters2019$Possible.Fairways
Masters2019$Num..of.Holes
Masters2019$Total.3.Putts
Masters2019$Avg..3.Putts
Masters2019$Rounds.BC
Masters2019$Total.Par.3.Holes
Masters2019$Total.Par.4.Holes
Masters2019$Total.Par.5.Holes
Masters2019$Rounds.Y.D
Masters2019$Avg.Driving.Distance.Y.D
Masters2019$Driving.Distance.Y.D
Masters2019$Total.Drives.Y.D
Masters2019$Driving.Accuracy.Y.D
Masters2019$Fairways.Hit.Y.D
Masters2019$Possible.Fairways.Y.D
Masters2019$Greens.Hit.Y.D
Masters2019$Num.of.Holes.Y.D
Masters2019$Greens.in.Reg.Y.D
The following code is used to help find the variables that will be deleted because they do not give quality information or repeat information that other variables display:
Note: These results are hidden but are shown in the comment next to the function.
# Variables to be removed from the dataset because they are not needed for the report
which(colnames(Masters2019)=="Name") #2
which(colnames(Masters2019)=="Rounds") #5
which(colnames(Masters2019)=="Total.Drives") #10
which(colnames(Masters2019)=="Possible.Fairways") #12
which(colnames(Masters2019)=="Num..of.Holes") #15
which(colnames(Masters2019)=="Rounds.BC") #31
which(colnames(Masters2019)=="Total.Par.3.Holes") #34
which(colnames(Masters2019)=="Total.Par.4.Holes") #37
which(colnames(Masters2019)=="Total.Par.5.Holes") #40
which(colnames(Masters2019)=="Fairways.Hit.Y.D") #46
which(colnames(Masters2019)=="Possible.Fairways.Y.D") #47
which(colnames(Masters2019)=="Greens.Hit.Y.D") #48
which(colnames(Masters2019)=="Num.of.Holes.Y.D") #49
# Variables that repeat information thus will be removed from the dataset
which(colnames(Masters2019)=="Total.Strokes") #6
which(colnames(Masters2019)=="Total.Distance") #9
which(colnames(Masters2019)=="Fairways.Hit") #11
which(colnames(Masters2019)=="Greens.Hit") #14
which(colnames(Masters2019)=="Total.Strokes.Before.Cut") #30
which(colnames(Masters2019)=="Par.3.Total.Strokes") #33
which(colnames(Masters2019)=="Par.4.Total.Strokes") #36
which(colnames(Masters2019)=="Par.5.Total.Strokes") #39
which(colnames(Masters2019)=="Driving.Distance.Y.D") #43
The following code will create a new, condensed, and more informative dataset and will be called Masters. This dataset is created by removing the unnecessary variables (shown above) from the original data set:
Note: This code will not give an output but will create the data frame that is needed for the report.
# Creating the new dataset
Masters <- Masters2019
# Importing the package "dplyr" to remove the variables
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Remove the unnecessary variables from the original data set
Masters <- select(Masters,-c(1,2,5,6,9,10,11,12,14,15,30,31,33,34,36,37,39,40,43,46,47,48,49,51))
names(Masters)
The following code will take a quick look at the new dataset to see if any cleaning needs to be done:
Note: The output is hidden to save space.
#All of the variables in the new dataset
names(Masters)
A new package is introduced and the function used allows a user to see quality information. For example, how many rows, columns, variables the dataset has as well as variable types, missing values, the mean, and more.
Note: the column p0 is the minimum value, p25 is the lower quantile value, p50 is the median value, p75 is the upper quantile value, and p100 is the maximum value.
# Importing the package "skimr" which shows a unique/neat way of summarizing all of the variables
library(skimr)
# A quick look at the characteristics of each of the variables
skim(Masters)
| Name | Masters |
| Number of rows | 65 |
| Number of columns | 27 |
| _______________________ | |
| Column type frequency: | |
| numeric | 27 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 31.68 | 7.47 | 19.00 | 27.00 | 30.00 | 36.00 | 61.00 | ▆▇▅▁▁ |
| Placement | 0 | 1.00 | 31.37 | 19.03 | 1.00 | 17.00 | 32.00 | 49.00 | 62.00 | ▇▆▇▃▇ |
| Avg..Strokes.per.Round | 0 | 1.00 | 71.25 | 1.40 | 68.75 | 70.25 | 71.25 | 72.25 | 74.00 | ▅▇▇▅▅ |
| Avg..Driving.Distance | 0 | 1.00 | 298.23 | 8.49 | 272.60 | 292.60 | 297.60 | 302.50 | 316.30 | ▁▃▇▅▃ |
| Driving.Accuracy | 0 | 1.00 | 68.38 | 7.73 | 50.00 | 62.50 | 69.64 | 73.21 | 85.71 | ▂▅▆▇▂ |
| Greens.in.Reg..Perc. | 0 | 1.00 | 66.15 | 5.39 | 55.56 | 62.50 | 66.67 | 69.44 | 80.56 | ▃▇▆▅▁ |
| Num..of.1.Putts | 0 | 1.00 | 27.85 | 4.28 | 18.00 | 25.00 | 28.00 | 31.00 | 37.00 | ▂▅▇▆▂ |
| X1.Putt.Perc. | 0 | 1.00 | 38.68 | 5.94 | 25.00 | 34.72 | 38.89 | 43.06 | 51.39 | ▂▅▇▆▂ |
| Avg..2.Putt | 0 | 1.00 | 10.21 | 1.00 | 8.25 | 9.50 | 10.25 | 11.00 | 13.00 | ▃▇▇▅▁ |
| Total.2.Putts | 0 | 1.00 | 40.83 | 3.99 | 33.00 | 38.00 | 41.00 | 44.00 | 52.00 | ▃▇▇▅▁ |
| Total.3.Putts | 5 | 0.92 | 2.78 | 1.58 | 1.00 | 1.00 | 3.00 | 3.25 | 7.00 | ▇▅▂▂▁ |
| Avg..3.Putts | 5 | 0.92 | 0.70 | 0.40 | 0.25 | 0.25 | 0.75 | 0.81 | 1.75 | ▇▅▂▂▁ |
| Total.Putts | 0 | 1.00 | 117.28 | 5.05 | 105.00 | 114.00 | 116.00 | 121.00 | 131.00 | ▁▆▇▅▁ |
| Low.Total.Putts | 0 | 1.00 | 26.80 | 1.78 | 24.00 | 25.00 | 26.00 | 28.00 | 31.00 | ▇▇▇▃▂ |
| Avg..Putts | 0 | 1.00 | 29.32 | 1.26 | 26.25 | 28.50 | 29.00 | 30.25 | 32.75 | ▁▆▇▅▁ |
| Birdie.to.Bogey | 0 | 1.00 | 1.66 | 1.81 | 0.65 | 0.94 | 1.31 | 1.70 | 15.00 | ▇▁▁▁▁ |
| Total.Birdies.or.Better | 0 | 1.00 | 15.92 | 2.89 | 9.00 | 14.00 | 16.00 | 18.00 | 25.00 | ▂▇▇▂▁ |
| Total.Bogeys.or.Worse | 0 | 1.00 | 12.23 | 3.85 | 1.00 | 10.00 | 13.00 | 15.00 | 21.00 | ▁▅▇▇▂ |
| Scoring.Avg..Before.Cut | 0 | 1.00 | 71.38 | 1.55 | 68.50 | 70.50 | 71.50 | 72.50 | 73.50 | ▆▆▆▇▇ |
| Par.3.Scoring.Avg. | 0 | 1.00 | 3.02 | 0.14 | 2.69 | 2.94 | 3.00 | 3.13 | 3.31 | ▂▂▇▂▃ |
| Par.4.Scoring.Avg. | 0 | 1.00 | 4.10 | 0.10 | 3.93 | 4.03 | 4.10 | 4.18 | 4.38 | ▅▇▇▁▁ |
| Par.5.Scoring.Avg. | 0 | 1.00 | 4.54 | 0.19 | 4.19 | 4.38 | 4.50 | 4.69 | 4.94 | ▅▆▇▅▂ |
| Rounds.Y.D | 8 | 0.88 | 38.67 | 8.10 | 21.00 | 33.00 | 40.00 | 43.00 | 53.00 | ▂▃▇▇▃ |
| Avg.Driving.Distance.Y.D | 8 | 0.88 | 298.38 | 7.37 | 282.50 | 293.90 | 298.30 | 303.30 | 312.60 | ▂▅▇▆▃ |
| Total.Drives.Y.D | 8 | 0.88 | 60.18 | 15.80 | 28.00 | 48.00 | 64.00 | 70.00 | 88.00 | ▅▆▆▇▃ |
| Driving.Accuracy.Y.D | 8 | 0.88 | 61.89 | 4.91 | 49.26 | 58.93 | 61.79 | 65.06 | 71.73 | ▁▃▇▅▃ |
| Greens.in.Reg.Y.D | 8 | 0.88 | 68.06 | 3.32 | 60.94 | 65.58 | 68.10 | 70.05 | 75.76 | ▂▇▇▅▂ |
After looking at the data, some missing values need to be dealt with:
Note: The way that the missing values are dealt with are shown in the code below and will be given an explanation to why following the code in the report. These results will also be hidden.
# Change the NA values to 0 because NA meant they had no 3 Putts throughout the entirety of the Tournament
Masters$Total.3.Putts[is.na(Masters$Total.3.Putts)] <- 0
Masters$Total.3.Putts
Masters$Avg..3.Putts[is.na(Masters$Avg..3.Putts)] <- 0
Masters$Avg..3.Putts
# Change NA values to mean or a different tournament stat
Masters$Rounds.Y.D[is.na(Masters$Rounds.Y.D)] <- round(mean(Masters2019$Rounds.Y.D, na.rm = TRUE))
Masters$Rounds.Y.D
Masters$Total.Drives.Y.D[is.na(Masters$Total.Drives.Y.D)] <- mean(Masters2019$Total.Drives.Y.D, na.rm = TRUE)
Masters$Total.Drives.Y.D
# User defined function that facilitates the process of changing NAs to another value from the data set
filter_NAs <- function( x , y ){
# Looks for NA values and replaces these values with the value in the
# same row of another variable
# Arguments:
# x : a variable in a data frame
# y : a variable in a data frame
# Example: x <- c(rep(NA,5))
# y <- c(seq(1:5))
# data.frame(x,y)
# x <- filter_NAs(x , y)
# x ---> [1] 1 2 3 4 5
for (i in which(is.na(x))) {
x[i] <- y[i]
}
x
}
Masters$Avg.Driving.Distance.Y.D <- filter_NAs(Masters$Avg.Driving.Distance.Y.D, Masters$Avg..Driving.Distance)
Masters$Driving.Accuracy.Y.D <- filter_NAs(Masters$Driving.Accuracy.Y.D, Masters$Driving.Accuracy)
Masters$Greens.in.Reg.Y.D <- filter_NAs(Masters$Greens.in.Reg.Y.D, Masters$Greens.in.Reg..Perc.)
#Check to see if any NAs still in data set (Run to see that we don't; all good to go)
Masters$Rounds.Y.D[is.na(Masters$Rounds.Y.D)]
Masters$Avg.Driving.Distance.Y.D[is.na(Masters$Avg.Driving.Distance.Y.D)]
Masters$Total.Drives.Y.D[is.na(Masters$Total.Drives.Y.D)]
Masters$Driving.Accuracy.Y.D[is.na(Masters$Driving.Accuracy.Y.D)]
Masters$Greens.in.Reg.Y.D[is.na(Masters$Greens.in.Reg.Y.D)]
The missing values in the variable Total.3.Putts were modified to be given the value of 0 since the reason that they were missing values, was because the certain players did not putt 3 times for any holes throughout the whole tournament.
The missing values for Rounds.Y.D and Total.Drives.Y.D were modified to be given the value of the variable’s mean (without the missing values). The reasoning behind this, was because giving the missing values the value of 0 did not make much sense and would have caused some disruption in the single variables, potentially skewing the data. Modifying the values to be given the value of the mean is more reasonable since the golfers most likely played many rounds that may not have been professional.
Lastly, the missing values for Avg.Driving.Distance.Y.D, Driving.Accuracy.Y.D and Greens.in.Reg.Y.D were modified to be given the same values of the distance, accuracy and greens in regulation percentage that were calculated for the tournament. In order to facilitate the process of changing the NA values, I created a user defined function that changes the missing value to another value from the data set of the same row. For example, the missing values from Avg.Driving.Distance.Y.D was given the values from variable Avg..Driving.Distance that correspond to the same rows. The reasoning behind this modification was to give a good approximation of what the observations are most likely to be rather than modifying the value to be 0 (which may cause unnecessary skewness) or to the mean. Modifying the missing value in these variables to the mean would not be an ideal approximation because a certain player may have tendencies to be lower than or above the average for the particular variable.
The following code will change certain variables into factors:
Note: The results will be hidden.
# A quick look at the variable's observations and change to a factor
sort(Masters$Num..of.1.Putts)
Masters$Num..of.1.Putts <- factor(Masters$Num..of.1.Putts,
levels = c(min(Masters$Num..of.1.Putts):max(Masters$Num..of.1.Putts)),
ordered = TRUE)
is.factor(Masters$Num..of.1.Putts)
Masters$Num..of.1.Putts
sort(Masters$Total.2.Putts)
Masters$Total.2.Putts <- factor(Masters$Total.2.Putts,
levels = c(min(Masters$Total.2.Putts):max(Masters$Total.2.Putts)),
ordered = TRUE)
is.factor(Masters$Total.2.Putts)
Masters$Total.2.Putts
sort(Masters$Total.3.Putts)
Masters$Total.3.Putts <- factor(Masters$Total.3.Putts,
levels = c(min(Masters$Total.3.Putts):max(Masters$Total.3.Putts)),
ordered = TRUE)
is.factor(Masters$Total.3.Putts)
Masters$Total.3.Putts
sort(Masters$Rounds.Y.D)
Masters$Rounds.Y.D <- factor(Masters$Rounds.Y.D,
levels = c(min(Masters$Rounds.Y.D):max(Masters$Rounds.Y.D)),
ordered = TRUE)
is.factor(Masters$Rounds.Y.D)
Masters$Rounds.Y.D
These variables were changed into factors because the variables are more like categories. For example, the numerical descriptions like the mean would not make sense because you can’t putt the ball 2.5 times.
A condensed version of the final dataset is shown below:
head(Masters, 10)
## Age Placement Avg..Strokes.per.Round Avg..Driving.Distance Driving.Accuracy
## 1 43 1 68.75 294.6 62.50
## 2 25 2 69.00 305.8 62.50
## 3 28 2 69.00 313.6 69.64
## 4 34 2 69.00 308.0 60.71
## 5 36 5 69.25 294.8 73.21
## 6 31 5 69.25 296.5 66.07
## 7 33 5 69.25 283.1 83.93
## 8 29 5 69.25 316.3 67.86
## 9 27 9 69.50 299.9 64.29
## 10 24 9 69.50 308.4 76.79
## Greens.in.Reg..Perc. Num..of.1.Putts X1.Putt.Perc. Avg..2.Putt Total.2.Putts
## 1 80.56 26 36.11 11.00 44
## 2 70.83 32 44.44 9.25 37
## 3 73.61 27 37.50 9.50 38
## 4 70.83 30 41.67 10.00 40
## 5 65.28 37 51.39 8.50 34
## 6 70.83 33 45.83 9.50 38
## 7 68.06 34 47.22 9.00 36
## 8 66.67 31 43.06 9.75 39
## 9 61.11 36 50.00 8.75 35
## 10 70.83 26 36.11 10.50 42
## Total.3.Putts Avg..3.Putts Total.Putts Low.Total.Putts Avg..Putts
## 1 2 0.50 120 28 30.00
## 2 3 0.75 115 28 28.75
## 3 5 1.25 118 29 29.50
## 4 1 0.25 113 24 28.25
## 5 0 0.00 105 25 26.25
## 6 1 0.25 112 26 28.00
## 7 2 0.50 112 25 28.00
## 8 1 0.25 112 26 28.00
## 9 0 0.00 106 24 26.50
## 10 4 1.00 122 29 30.50
## Birdie.to.Bogey Total.Birdies.or.Better Total.Bogeys.or.Worse
## 1 2.44 22 9
## 2 2.08 25 12
## 3 2.33 21 9
## 4 3.40 17 5
## 5 4.25 17 4
## 6 2.71 19 7
## 7 2.29 16 7
## 8 2.83 17 6
## 9 2.11 19 9
## 10 2.33 14 6
## Scoring.Avg..Before.Cut Par.3.Scoring.Avg. Par.4.Scoring.Avg.
## 1 69.0 2.75 3.98
## 2 69.0 2.94 4.03
## 3 68.5 3.19 3.95
## 4 69.0 2.94 4.03
## 5 68.5 2.94 3.93
## 6 68.5 2.88 4.00
## 7 71.5 3.06 3.93
## 8 70.5 3.06 4.03
## 9 73.0 2.94 3.98
## 10 69.5 3.00 3.95
## Par.5.Scoring.Avg. Rounds.Y.D Avg.Driving.Distance.Y.D Total.Drives.Y.D
## 1 4.50 25 299.6 40
## 2 4.25 41 303.2 52
## 3 4.19 33 309.0 44
## 4 4.25 35 304.6 56
## 5 4.56 31 291.2 40
## 6 4.44 33 298.4 44
## 7 4.44 37 284.6 68
## 8 4.19 45 310.0 76
## 9 4.50 35 309.1 56
## 10 4.50 43 305.8 72
## Driving.Accuracy.Y.D Greens.in.Reg.Y.D
## 1 64.64 75.56
## 2 60.45 71.20
## 3 61.32 69.63
## 4 54.59 70.83
## 5 65.88 63.43
## 6 63.90 70.93
## 7 69.25 68.79
## 8 59.45 66.53
## 9 56.70 70.31
## 10 63.75 69.03
Tiger Woods Winning the 2019 Masters Tournament
Each variable in the final dataset are described and explained as follows, showing basic graphs and summary statistics:
summary(Masters$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.00 30.00 31.68 36.00 61.00
par(mfrow = c(2,1))
hist(Masters$Age,
main = "Histogram for Age per Player (n = 65)",
xlab = "Age",
ylab = "Probability Frequency",
ylim = range(0 , 0.08),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Age), col = 'red', lwd = 1.5)
boxplot(Masters$Age,
main = "Boxplot for Age per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Age), labels=fivenum(Masters$Age), y=1.4)
The variable, Age, is represented in the graphs above. The histogram shows that the variable was skewed to the right ranging from 19 to 61. Also, the boxplot and the summary function displays that the variable’s inner quantile range, ranges from around 27 to 36 with a median of around 30. This variable also contains one outlier at 61 causing the histogram to be slightly skewed to the right.
summary(Masters$Placement)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 32.00 31.37 49.00 62.00
par(mfrow = c(2,1))
hist(Masters$Placement,
main = "Histogram of Leaderboard Placements (n = 65)",
xlab = "Leaderboard Placement",
ylab = "Probability Frequency",
ylim = range(0 , 0.03),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Placement), col = 'red', lwd = 2)
boxplot(Masters$Placement,
main = "Boxplot of Leaderboard Placements (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Placement), labels=fivenum(Masters$Placement), y=1.4)
The variable, leaderboard placements, is represented in the graphs above. The histogram shows that the variable is uniform ranging from 1 to 62. Also, the boxplot and the summary function displays that the variable’s inner quantile range, ranges from around 17 to 49 with a median of around 32. Note that this variable contains no outliers.
summary(Masters$Avg..Strokes.per.Round)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.75 70.25 71.25 71.25 72.25 74.00
par(mfrow = c(2,1))
hist(Masters$Avg..Strokes.per.Round,
main = "Histogram for the Average Strokes per Round (n = 65)",
xlab = "Average Strokes per Round",
ylab = "Probability Frequency",
ylim = range(0 , 0.5),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Avg..Strokes.per.Round), col = 'red', lwd = 1.5)
boxplot(Masters$Avg..Strokes.per.Round,
main = "Boxplot for the Average Strokes per Round (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Avg..Strokes.per.Round),
labels=fivenum(Masters$Avg..Strokes.per.Round), y=1.4)
The variable Average Strokes per round is represented by the graphs above. The histogram shows that the variable is somewhat multimodal, ranging from 69 to 74 strokes per round. Also, the boxplot and the summary function displays that the variable’s inner quantile range, ranges from around 70 to 72 with a median of around 71. Note that this variable contains no outliers.
summary(Masters$Avg..Driving.Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 272.6 292.6 297.6 298.2 302.5 316.3
par(mfrow = c(2,1))
hist(Masters$Avg..Driving.Distance,
main = "Histogram for the Average Driving Distance per Player (n = 65)",
xlab = "Average Driving Distance",
ylab = "Probability Frequency",
ylim = range(0 , 0.08),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Avg..Driving.Distance), col = 'red', lwd = 1.5)
boxplot(Masters$Avg..Driving.Distance,
main = "Boxplot for the Average Driving Distance per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Avg..Driving.Distance),
labels=fivenum(Masters$Avg..Driving.Distance), y=1.4)
The variable, Average Driving Distance per player, is represented in the graphs above. The histogram shows that the variable is bimodal, with some skewness to the left, ranging from 270 to 316. The boxplot and the summary function display the variable’s inner quantile range, which ranges from around 290 to 300 with a median of around 298. This variable contains one outliers around 270 which helps cause the skewness.
summary(Masters$Driving.Accuracy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 62.50 69.64 68.38 73.21 85.71
par(mfrow = c(2,1))
hist(Masters$Driving.Accuracy,
main = "Histogram for the Driving Accuracy per Player (n = 65)",
xlab = "Driving Accuracy",
ylab = "Probability Frequency",
ylim = range(0 , 0.08),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Driving.Accuracy), col = 'red', lwd = 1.5)
boxplot(Masters$Driving.Accuracy,
main = "Boxplot for the Driving Accuracy per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Driving.Accuracy),
labels=fivenum(Masters$Driving.Accuracy), y=1.4)
The variable, Driving Accuracy, is represented in the graphs above. The histogram shows that the variable is somewhat normal, having symmetry while ranging from around 50 to 85. Along with the histogram, the boxplot and the summary function display that the variable’s inner quantile range, ranges from around 62 to 73 with a median of around 70. Note that this variable contains no outliers.
summary(Masters$Greens.in.Reg..Perc.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.56 62.50 66.67 66.15 69.44 80.56
par(mfrow = c(2,1))
hist(Masters$Greens.in.Reg..Perc.,
main = "Histogram for the Percentage of Greens in Regulation per Player (n = 65)",
xlab = "Percentage of Greens in Regulation",
ylab = "Probability Frequency",
ylim = range(0 , 0.10),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Greens.in.Reg..Perc.), col = 'red', lwd = 1.5)
boxplot(Masters$Greens.in.Reg..Perc.,
main = "Boxplot for the Percentage of Greens in Regulation per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Greens.in.Reg..Perc.), labels=fivenum(Masters$Greens.in.Reg..Perc.), y=1.4)
The variable, Greens in Regulation Percentage, is represented in the graphs above. The histogram shows that the variable is somewhat normal, having symmetry while ranging from around 55 to 80. The boxplot and the summary function display that the variable’s inner quantile range, ranges from around 62 to 69 with a median of around 67. Note that this variable contains one outlier around the percentage of 80. This outlier causes the histogram to look slightly skewed to the right.
summary(Masters$Num..of.1.Putts)
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## 1 1 1 2 4 1 2 6 9 4 3 7 5 6 5 3 2 0 1 2
plot(Masters$Num..of.1.Putts,
main = "Bar Chart for the Number of 1 Putts (n = 65)",
xlab = "Number of 1 Putts",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
The variable, Number of 1 Putts, is represented in the graph above. The bar chart shows that the majority of players had 26 one putts. The bar chart, as well as the summary statistic shows that the variable has values that range from around 50 to 85.
summary(Masters$X1.Putt.Perc.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 34.72 38.89 38.68 43.06 51.39
par(mfrow = c(2,1))
hist(Masters$X1.Putt.Perc.,
main = "Histogram for the Percentage of 1 Putts (n = 65)",
xlab = "Percentage of 1 Putts",
ylab = "Probability Frequency",
ylim = range(0 , 0.10),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$X1.Putt.Perc.), col = 'red', lwd = 1.5)
boxplot(Masters$X1.Putt.Perc.,
main = "Boxplot for the Percentage of 1 Putts (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$X1.Putt.Perc.), labels=fivenum(Masters$X1.Putt.Perc.), y=1.4)
The variable, Percentage of 1 Putts, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric. Also, the variable ranges from around 25 to 51. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 35 to 43 with a median of around 39. Note that this variable contains no outliers.
summary(Masters$Avg..2.Putt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.25 9.50 10.25 10.21 11.00 13.00
par(mfrow = c(2,1))
hist(Masters$Avg..2.Putt,
main = "Histogram for the Average 2 Putts per Player (n = 65)",
xlab = "Average 2 Putts",
ylab = "Probability Frequency",
ylim = range(0 , 0.5),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Avg..2.Putt), col = 'red', lwd = 1.5)
boxplot(Masters$Avg..2.Putt,
main = "Boxplot for the Average 2 Putts per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Avg..2.Putt), labels=fivenum(Masters$Avg..2.Putt), y=1.4)
The variable, Average amount of 2 Putts per player, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric. Also, the variable ranges from around 8 to 13. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 9 to 11 with a median of around 10 two putts per round. Note that this variable contains no outliers.
summary(Masters$Total.2.Putts)
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 1 4 2 2 4 5 8 4 8 5 5 4 5 2 5 0 0 0 0 1
plot(Masters$Total.2.Putts,
main = "Bar Chart for the Number of 2 Putts (n = 65)",
xlab = "Number of 2 Putts",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
The variable, Number of 2 Putts, is represented in the graph above. The bar chart shows that the majority of players had arpund 39 to 41 total two putts. The bar chart, as well as the summary statistic, shows that the variable has values that range from around 33 to 51, with the value 51 possibly being an outlier.
summary(Masters$Total.3.Putts)
## 0 1 2 3 4 5 6 7
## 5 16 11 18 6 5 2 2
plot(Masters$Total.3.Putts,
main = "Bar Chart for the Number of 3 Putts (n = 65)",
xlab = "Number of 3 Putts",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
The variable, Number of 3 Putts, is represented in the graph above. The bar chart shows that the majority of players had 1 to 3 total three putts. The bar chart, as well as the summary statistic, shows that the variable has values that range from 0 to 7.
summary(Masters$Avg..3.Putts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2500 0.7500 0.6423 0.7500 1.7500
par(mfrow = c(2,1))
hist(Masters$Avg..3.Putts,
main = "Histogram for the Average 3 Putts per player (n = 65)",
xlab = "Average 3 Putts",
ylab = "Probability Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
boxplot(Masters$Avg..3.Putts,
main = "Boxplot for the Average 3 Putts per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
## Warning in bxp(list(stats = structure(c(0, 0.25, 0.75, 0.75, 1.5), .Dim =
## c(5L, : some notches went outside hinges ('box'): maybe set notch=FALSE
text(x=fivenum(Masters$Avg..3.Putts),
labels=fivenum(Masters$Avg..3.Putts), y=1.4)
The variable, Average amount of 3 Putts per player, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat skewed to the right. Also, the variable ranges from around 0 to 1.75 three putts per round. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 0.25 to 0.75 with a median of around 0.75 three putts per round. Note that this variable contains one outlier of which the value is 1.75 three putts per round which causes the slight skewness.
summary(Masters$Total.Putts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105.0 114.0 116.0 117.3 121.0 131.0
par(mfrow = c(2,1))
hist(Masters$Total.Putts,
main = "Histogram for the Total Putts per Player (n = 65)",
xlab = "Total Putts per Player",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
boxplot(Masters$Total.Putts,
main = "Boxplot for the Total Putts per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Total.Putts), labels=fivenum(Masters$Total.Putts), y=1.4)
The variable, Total Putts per player, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric, centered around 117, ranging from around 105 to 131 total putts. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 114 to 121 with a median of around 116 total putts. Note that this variable contains no outliers.
summary(Masters$Low.Total.Putts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 25.0 26.0 26.8 28.0 31.0
par(mfrow = c(2,1))
hist(Masters$Low.Total.Putts,
main = "Histogram for the Lowest Amount of Putts per Player (n = 65)",
xlab = "Lowest Amount of Putts per Player",
ylab = "Probability Frequency",
ylim = range(0 , 0.35),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Low.Total.Putts), col = 'red', lwd = 1.5)
boxplot(Masters$Low.Total.Putts,
main = "Boxplot for the Lowest Amount of Putts per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Low.Total.Putts), labels=fivenum(Masters$Low.Total.Putts), y=1.4)
The variable, Lowest Amount of Putts, is represented in the graphs above. The histogram shows that the variable is bimodal and somewhat symmetric, having centers around 25 and 28. The variable ranges from around 24 to 31 putts. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 25 to 28 with a median of around 26 total putts. Note that this variable contains no outliers.
summary(Masters$Avg..Putts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.25 28.50 29.00 29.32 30.25 32.75
par(mfrow = c(2,1))
hist(Masters$Avg..Putts,
main = "Histogram for the Average Amount of Putts per Round (n = 65)",
xlab = "Average Amount of Putts per Player",
ylab = "Probability Frequency",
ylim = range(0 , 0.5),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Avg..Putts), col = 'red', lwd = 1.5)
boxplot(Masters$Avg..Putts,
main = "Boxplot for the Average Amount of Putts per Round (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Avg..Putts), labels=fivenum(Masters$Avg..Putts), y=1.4)
The variable, Average Amount of Putts per round, is represented in the graphs above. The histogram shows that the variable is unimodal, centered around 29, while ranging from around 26 to 33 putts per round. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 28 to 30 with a median of around 29 putts per round. Note that this variable contains no outliers.
summary(Masters$Birdie.to.Bogey)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.65 0.94 1.31 1.66 1.70 15.00
par(mfrow = c(2,1))
hist(Masters$Birdie.to.Bogey,
main = "Histogram for the Birdie to Bogey Ratio per Player (n = 65)",
xlab = "Birdie to Bogey Ratio Per Player",
ylab = "Probability Frequency",
ylim = range(0 , 0.8),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Birdie.to.Bogey), col = 'red', lwd = 1.5)
boxplot(Masters$Birdie.to.Bogey,
main = "Boxplot for the Birdie to Bogey Ratio per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
The variable, Birdie to Bogey Ratio, is represented in the graphs above. The histogram shows that the variable is unimodal and skewed to the right, centered around 1. The variable ranges from around 0.5 to 15 birdies to bogeys. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 1 to 1.7 with a median of around 1.3 birdies to bogeys. This variable contains a few outliers of ratios around 4 and the highest being 15. These outliers help cause the skewness in the histogram.
summary(Masters$Total.Birdies.or.Better)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 14.00 16.00 15.92 18.00 25.00
par(mfrow = c(2,1))
hist(Masters$Total.Birdies.or.Better,
main = "Histogram for the Total Birdies (or Better) per Player (n = 65)",
xlab = "Total Birdies or Better",
ylab = "Probability Frequency",
ylim = range(0 , 0.20),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Total.Birdies.or.Better), col = 'red', lwd = 1.5)
boxplot(Masters$Total.Birdies.or.Better,
main = "Boxplot for the Total Birdies (or Better) per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Total.Birdies.or.Better),
labels=fivenum(Masters$Total.Birdies.or.Better), y=1.4)
The variable, Total Birdies or Better, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric, centered around 16, ranging from around 10 to 25 total birdies or better. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 14 to 18 with a median of around 16 total putts. Note that this variable contains one outlier, which has the value of 25 total birdies (or better). This outlier may cause a slight skewness to the right but not much.
summary(Masters$Total.Bogeys.or.Worse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 10.00 13.00 12.23 15.00 21.00
par(mfrow = c(2,1))
hist(Masters$Total.Bogeys.or.Worse,
main = "Histogram for the Total Bogeys (or Worse) per Player (n = 65)",
xlab = "Total Bogeys or Worse",
ylab = "Probability Frequency",
ylim = range(0 , 0.18),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Total.Bogeys.or.Worse), col = 'red', lwd = 1.5)
boxplot(Masters$Total.Bogeys.or.Worse,
main = "Boxplot for the Total Bogeys (or Worse) per Player (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Total.Bogeys.or.Worse),
labels=fivenum(Masters$Total.Bogeys.or.Worse), y=1.4)
The variable, Total Bogeys or Worse, is represented in the graphs above. The histogram shows that the variable is unimodal and skewed to the left, centered around 13, ranging from around 0 to 20 total bogeys or worse. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 10 to 15 with a median of around 13 total bogeys or worse. Note that this variable contains one outlier which has the value of 1 bogey (or worse). This outlier may cause the slight skewness in the histogram.
summary(Masters$Scoring.Avg..Before.Cut)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.50 70.50 71.50 71.38 72.50 73.50
par(mfrow = c(2,1))
hist(Masters$Scoring.Avg..Before.Cut,
main = "Histogram for the Scoring Average Before the Cut (n = 65)",
xlab = "Scoring Average Before Cut",
ylab = "Probability Frequency of Greens Hit",
ylim = range(0 , 0.45),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Scoring.Avg..Before.Cut), col = 'red', lwd = 1.5)
boxplot(Masters$Scoring.Avg..Before.Cut,
main = "Boxplot for the Scoring Average Before the Cut (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Scoring.Avg..Before.Cut),
labels=fivenum(Masters$Scoring.Avg..Before.Cut), y=1.4)
The variable, Scoring Average Before the Cut, is represented in the graphs above. The histogram shows that the variable is multimodal, with centers around 69, 71, and 73. The varible ranges from around 69 to 74 total strokes per round. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 70 to 72 with a median of around 71 total strokes per round. Note that this variable contains no outliers.
summary(Masters$Par.3.Scoring.Avg.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.690 2.940 3.000 3.023 3.130 3.310
par(mfrow = c(2,1))
hist(Masters$Par.3.Scoring.Avg.,
main = "Histogram for the Par 3 Scoring Average (n = 65)",
xlab = "Par 3 Scoring Average",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
boxplot(Masters$Par.3.Scoring.Avg.,
main = "Boxplot for the Par 3 Scoring Average (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Par.3.Scoring.Avg.),
labels=fivenum(Masters$Par.3.Scoring.Avg.), y=1.4)
The variable, Par 3 Scoring Average, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric, centered around 3 strokes, with a spread of around 2.6 to 3.4 strokes per par 3. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 2.9 to 3.1 with a median of around 3 strokes per par 3. Note that this variable contains no outliers.
summary(Masters$Par.4.Scoring.Avg.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.930 4.030 4.100 4.104 4.180 4.380
par(mfrow = c(2,1))
hist(Masters$Par.4.Scoring.Avg.,
main = "Histogram for the Par 4 Scoring Average (n = 65)",
xlab = "Par 4 Scoring Average",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
boxplot(Masters$Par.4.Scoring.Avg.,
main = "Boxplot for the Par 4 Scoring Average (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Par.4.Scoring.Avg.),
labels=fivenum(Masters$Par.4.Scoring.Avg.), y=1.4)
The variable, Par 4 Scoring Average, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric, centered around 4.1, with a spread of around 3.9 to 4.4 strokes per par 4. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 4 to 4.2 with a median of around 4.1 strokes per par 4. Note that this variable contains no outliers.
summary(Masters$Par.5.Scoring.Avg.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.190 4.380 4.500 4.538 4.690 4.940
par(mfrow = c(2,1))
hist(Masters$Par.5.Scoring.Avg.,
main = "Histogram for the Par 5 Scoring Average (n = 65)",
xlab = "Par 5 Scoring Average",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
boxplot(Masters$Par.5.Scoring.Avg.,
main = "Boxplot for the Par 5 Scoring Average (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Par.5.Scoring.Avg.),
labels=fivenum(Masters$Par.5.Scoring.Avg.), y=1.4)
The variable, Par 5 Scoring Average, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric, centered around 4.5, with a spread of around 4.2 to 5 strokes per par 5. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 4.4 to 4.7 with a median of around 4.5 strokes per par 5. Note that this variable contains no outliers.
summary(Masters$Rounds.Y.D)
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## 1 1 0 1 1 2 0 0 2 0 3 0 4 2 5 2 3 0 8 3 3 4 6 1 3 1
## 47 48 49 50 51 52 53
## 0 1 3 0 1 1 3
plot(Masters$Rounds.Y.D,
main = "Bar Chart for the Number of Rounds Played before Tournament (n = 65)",
xlab = "Number of Rounds",
ylab = "Frequency",
border = "black",
col = "lightsteelblue",
las = 1)
The variable, Number of Rounds played before the tournament, is represented in the graph above. The bar chart shows that the majority of players played 39 rounds. The bar chart, as well as the summary statistic, shows that the variable has values that range from around 21 to 53.
summary(Masters$Avg.Driving.Distance.Y.D)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 272.6 292.7 298.3 297.9 303.2 312.6
par(mfrow = c(2,1))
hist(Masters$Avg.Driving.Distance.Y.D,
main = "Histogram for the Avg. Driving Distance before Tournament (n = 65)",
xlab = "Average Driving Distance",
ylab = "Probability Frequency",
ylim = range(0 , 0.07),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Avg.Driving.Distance.Y.D), col = 'red', lwd = 1.5)
boxplot(Masters$Avg.Driving.Distance.Y.D,
main = "Boxplot for the Avg. Driving Distance before Tournament (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Avg.Driving.Distance.Y.D),
labels=fivenum(Masters$Avg.Driving.Distance.Y.D), y=1.4)
The variable, Average Driving Distance before the tournament, is represented in the graphs above. The histogram shows that the variable is unimodal and skewed to the left, centered around 300, with a spread of around 270 to 310 yards. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 292 to 303 with a median of around 298 yards per drive. This variable contains one outlier which has the value of around 273 yards. This outlier may cause the slight skewness in the histogram.
summary(Masters$Total.Drives.Y.D)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.00 48.00 60.18 60.18 68.00 88.00
par(mfrow = c(2,1))
hist(Masters$Total.Drives.Y.D,
main = "Histogram for the Total Amount of Drives before Tournament (n = 65)",
xlab = "Total Amount of Drives",
ylab = "Probability Frequency",
ylim = range(0 , 0.05),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Total.Drives.Y.D), col = 'red', lwd = 1.5)
boxplot(Masters$Total.Drives.Y.D,
main = "Boxplot for the Total Amount of Drives before Tournament (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Total.Drives.Y.D), labels=fivenum(round(Masters$Total.Drives.Y.D)), y=1.4)
The variable, Total Amount of Drives before the tournament, is represented in the graphs above. The histogram shows that the variable is unimodal and slightly skewed to the left. The distribution is centered around 60, with a spread of around 30 to 90 total drives. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 48 to 68 with a median of around 60 total drives. Note that this variable contains no outliers.
summary(Masters$Driving.Accuracy.Y.D)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.26 59.45 62.19 63.31 66.95 85.71
par(mfrow = c(2,1))
hist(Masters$Driving.Accuracy.Y.D,
main = "Histogram for the Driving Accuracy before Tournament (n = 65)",
xlab = "Driving Accuracy",
ylab = "Probability Frequency",
ylim = range(0 , 0.1),
border = "black",
col = "lightsteelblue",
las = 1, #rotates the y axis numbers to horizontal
labels = TRUE,
prob = TRUE)
lines(density(Masters$Driving.Accuracy.Y.D), col = 'red', lwd = 1.5)
boxplot(Masters$Driving.Accuracy.Y.D,
main = "Boxplot for the Driving Accuracy before Tournament (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Driving.Accuracy.Y.D),
labels=fivenum(round(Masters$Driving.Accuracy.Y.D)), y=1.4)
The variable, Driving Accuracy before the tournament, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric with a slight skewness to the right. The distribution is centered around 62, with a spread of around 50 to 90 percent accuracy. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 59 to 67 with a median of around 62 percent accuracy. Note that this variable contains a couple of outliers ranging from aroun 80 to 86. These outliers may cause the histogram distribution to be skewed to the right.
summary(Masters$Greens.in.Reg.Y.D)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.72 65.15 67.80 67.72 70.05 75.76
par(mfrow = c(2,1))
hist(Masters$Greens.in.Reg.Y.D,
main = "Histogram for the Greens in Regulation before Tournament (n = 65)",
xlab = "Percentage of Greens in Regulation",
ylab = "Probability Frequency",
ylim = range(0 , 0.16),
border = "black",
col = "lightsteelblue",
las = 1,
labels = TRUE,
prob = TRUE)
lines(density(Masters$Greens.in.Reg.Y.D), col = 'red', lwd = 1.5)
boxplot(Masters$Greens.in.Reg.Y.D,
main = "Boxplot for the Greens in Regulation before Tournament (n = 65)",
notch = TRUE,
col = "lightsteelblue",
horizontal = TRUE)
text(x=fivenum(Masters$Greens.in.Reg.Y.D), labels=fivenum(Masters$Greens.in.Reg.Y.D), y=1.4)
The variable, Greens in Regulation before the tournament, is represented in the graphs above. The histogram shows that the variable is unimodal and somewhat symmetric, centered around 68, with a spread of around 60 to 75 percent greens in regulation. The boxplot, as well as the summary function, displays that the variable’s inner quantile range, ranges from around 65 to 70 with a median of around 68 percent. Note that this variable contains no outliers.
boxplot(Masters$Greens.in.Reg..Perc. ~ Masters$Num..of.1.Putts,
main = "Greens in Reg. Percentage of different Num. of 1 Putts",
xlab = "Number of 1 Putts",
ylab = "Greens in Regulation Percentage",
col = c("lightsteelblue"))
In this bivariate analysis, I looked at the Greens in Regulation Percentage and the Number of 1 Putts for the 2019 Masters tournament. In order to analyze the relationship between the quantitative and categorical variables, I looked at the different boxplots of each number of 1 putts and found that there is not much association between the two variables. However, there seems to be a slight negative trend between the two variables. For example, as the number of 1 putts increases, it appears that the greens in regulation percentage is trending down. One reason for this may be because if a player does not hit the green in regulation, thus putting for par or better, this means that the player may be hitting a chip shot around the green. This may cause the player to be closer for their first putt, giving them a better chance of making their first putt.
plot(Masters$Age, Masters$Avg..Driving.Distance,
main = "Age vs. Driving Distance", xlab = "Age", ylab = "Avg. Driving Distance", col = "black")
with(Masters, lines(loess.smooth(Age, Avg..Driving.Distance), col = 'red'))
cor(Masters$Age, Masters$Avg..Driving.Distance)
## [1] -0.4247037
In this bivariate analysis of the variables Age and Average Driving Distance, I used a scatterplot with a line of best fit to analyze their relationship. In the scatterplot, the two variables tend to have a moderate negative linear correlation. This correlation may be effected by the outlier of the older player, however, intuitively, this correlation tends to make sense because usually, the older a person gets, the less strength they may have. Other “noise” factors could be involved in the plot that we may not see. For example, height and weight could be a factor or possible injuries.
library(ggplot2)
Masters_new <- Masters %>%
mutate(Placement = ifelse(Placement < median(Placement),0,1)) %>%
mutate(Placement = factor(Placement, levels = c(0,1), labels = c("Top","Bottom"), ordered = T))
ggplot(data = Masters_new,
mapping = aes(x = Driving.Accuracy, y = Age, color = (Placement)), xlab = "Title") + geom_point() + scale_color_manual(breaks = c("Top","Bottom"), values = c("darkblue", "darkgoldenrod1")) + ggtitle("Driving Accuracy and Age for Top Half and Bottom Half Finishes") + xlab("Driving Accuracy") + ylab("Age")
cor(Masters$Driving.Accuracy, Masters$Age)
## [1] -0.02405185
In this multivariate analysis, I analyzed the variables Driving Accuracy, Age, and Placement (Top Half/Bottom Half) to see if there was an association between the three variables. In order to modify the Placement variable, I split the top half (less than the median) and the bottom half (greater than the median) into two factors/categories. Then, I created a scatterplot with the Driving Accuracy and Age of the players and color coordinated them into the two categories of Placement. After doing so, I noticed that in the overall scatterplot, the two variables, Driving Accuracy and Age appeared to have a weak negative linear correlation. Then, after analyzing the top and bottom half of the leaderboard, the top half seemed to have a moderate-weak negative, linear correlation between the Driving Accuracy and Age and in contrast, the bottom half of the leaderboard seemed to have a slight moderate-weak positive, linear correlation between the two variables. Even though there is some contrast, the correlations seem to be too weak to be potentially related.
plot(Masters$Driving.Accuracy, Masters$Avg..Driving.Distance,
main = "Driving Distance to Driving Accuracy Relationship",
xlab = "Driving Accuracy",
ylab = "Avg Driving Distance")
with(Masters, lines(loess.smooth(Driving.Accuracy, Avg..Driving.Distance), col = 'red'))
cor(Masters$Driving.Accuracy, Masters$Avg..Driving.Distance)
## [1] -0.3442147
In this bivariate analysis of the two variables, Driving Accuracy and Avgerage Driving Distance, I used a scatterplot to analyze their relationship. The scatterplot displays a moderate-weak negative linear relationship. Since the relationship is somewhat moderate, it is worth noting for future analyses.
ggplot(data = Masters_new,
mapping = aes(x = Greens.in.Reg..Perc., y = Avg..Driving.Distance, color = (Placement)), xlab = "Title") + geom_point() + scale_color_manual(breaks = c("Top","Bottom"), values = c("darkblue", "darkgoldenrod1")) + ggtitle("Greens in Reg. Percentage and Avg. Driving Distance for Different Placements") + xlab("Greens in Regulation Percentage") + ylab("Average Driving Distance")
cor(Masters$Greens.in.Reg..Perc., Masters$Avg..Driving.Distance)
## [1] 0.2003943
In this multivariate analysis, I analyzed the variables Greens in Regulation Percentage, Average Driving Distance, and Placement (top half/bottom half finishes). In order to do so, I created a scatterplot between the two variables Greens in Reg. and Average Driving Distance and color coordinated each observation on whether they placed in the top half of the leaderboard, or the bottom half of the leaderboard. After analyzing the graph, it seems as if the two variables have a weak positive linear relationship. However, when they are split into the categories of placement, it appears as if there are two clusters of data. The majority of the bottom half of the leaderboard (yellow) ranges from 55 to 67 percent greens in regulation while driving the ball, on average, from a range of 280 to 305. In comparison, the majority of the top half of the leaderboard tends to hit 65 to 75 percent greens in regulation with an average driving distance of around 280 to 310.
Masters_new <- Masters_new %>%
mutate(Total.Birdies.or.Better = ifelse(Total.Birdies.or.Better <= median(Total.Birdies.or.Better),0,1)) %>%
mutate(Total.Birdies.or.Better= factor(Total.Birdies.or.Better, levels = c(0,1), labels = c("Less Birdies","More Birdies"), ordered = T))
table(Masters_new$Total.Birdies.or.Better, Masters_new$Placement)
##
## Top Bottom
## Less Birdies 10 26
## More Birdies 21 8
prop.table(table(Masters_new$Total.Birdies.or.Better, Masters_new$Placement))
##
## Top Bottom
## Less Birdies 0.1538462 0.4000000
## More Birdies 0.3230769 0.1230769
#row proportions
prop.table(table(Masters_new$Total.Birdies.or.Better, Masters_new$Placement), margin = 1)
##
## Top Bottom
## Less Birdies 0.2777778 0.7222222
## More Birdies 0.7241379 0.2758621
In this bivariate analysis, I looked at two variables, Total Birdies or Better and Placement. In order to analyze these two variables, I factored each variable into two categories. Placement (as before) was placed into the top half and bottom half of the leaderboard depending on if they finished below or above the median placement. I then modified the variable, Total Birdies or Better by, again, placing the variable into two factors, less birdies (or better) and more birdies (or better). I did so by giving the values of 0 to an observation if the original value was below the median amount of birdies or better (less birdies) and a 1 to an observation if the original value was above the median amount of birdies or better (more birdies). After analyzing the frequency and relative frequency tables, it is clear that the two variables are associated since the more birdies (or better) a player has, the more likely they will be finish on the top of the leaderboard. Also, the majority of players in the tournament made less birdies (or better) than the median and finished on the bottom of the leaderboard.
ggplot(data = Masters_new,
mapping = aes(x = Greens.in.Reg.Y.D, y = Scoring.Avg..Before.Cut, color = (Placement))) + geom_point() + scale_color_manual(breaks = c("Top","Bottom"), values = c("darkblue", "darkgoldenrod1")) + ggtitle("Greens in Reg. (Before Tournament) and Scoring Avg. before the Cut per Placements") + xlab("Greens in Regulation Before Tournament") + ylab("Scoring Average Before Cut")
cor(Masters$Greens.in.Reg.Y.D, Masters$Scoring.Avg..Before.Cut)
## [1] -0.333825
In this multivariate analysis, I created a scatterplot between the two variables, Greens in Regulation (before the tournament) and Scoring Average Before the Cut, and color coordinated them by Placement (top/bottom half). By using the correlation function, I was able to notice that the two variables had a moderate-weak negative linear correlation. However, after categorizing them by placement, I was able to notice that there appears to be two clusters of data. The bottom half finishers tended to have more observations below 67 percent greens in regulation before the tournament as well as a scoring average of over 71 before the cut. In comparison, the top half finishers appeared to have just a few more observations over 67 percent greens in regulation before the tournament (nothing significant however) as well as having more observations scoring below 71 before the cut.
In this report, I explored different variables from the 2019 Masters Tournament in order to determine if certain variables have relationships. After gathering and cleaning the data, I completed a univariate analysis on each of the variables that were in the final data set, using descriptive statistics and graphs in RStudio. I then explored the relationship between multiple different variables by using bivariate and multivarite analyses in order to determine if the variables were related. I was not able to find any strong relationships between the variables that I thought would be associated. Also, when analyzing the relationship between variables that I did not expect to be related, the variables did not appear to have any relationship, however, I was able to find some interesting patterns when adding another variable to the bivariate anaylses. For example, I was able to visualize different clusters in the scatterplots, along with visualize the different trends that each cluster had. In the multivariate analysis of the greens in regulation percentage and the average driving distance for each category of placement (top half finishes/bottom half finishes), two clusters appeared out of the scatterplot displaying that bottom half finishers tended to drive the ball slightly shorter than top half finishers while also having a smaller percentages of greens in regulation.
After analyzing the dataset, it was clear that there were some limitations to the data set that made it difficult to determine relationships. First, the data was only for the tournament in the year of 2019. If there were more years in the data set, rather than only 2019, I would have been able to look at trends throughout the different years. Another limitation of my analysis is the amount of missing values. Even though I was able to clean up the data in a way that the missing values were taken care of, some of the missing values were not exact data, but rather an approximate. Lastly, another limitation of my analysis is the variables in the dataset. In the given dataset, there seemed to not have much, if any, correlation between the variables. This makes me think that there are other factors that possibly have more relationships/correlations between some of the variables than the variables that were included in the dataset.
My analysis could be improved by gathering more data and variables which would allow me to analyze more relationships to determine what factors are associated. Also, my analysis can be improved by learning more techniques on analyzing data which may allow me to catch something that I may have looked over in the initial analysis. That being said, the next steps in the analysis are to gather more data and analyze the relationships, as well as to learn and implement regression techniques to allow for a more quality analysis. Also, gaining more knowledge in the game of golf will allow for better a analysis. Once actions are taken towards the next steps, I hope to find key relationships between different variables and use these variables to predict future statistics.